discrete action space
A Detailed Proof 1 A.1 Proof of Theorem 4.1
We can compute the fixed point of the recursion in Equation A.2 and get the following estimated Then we compare these two gaps. To utilize the Eq. 4 for policy optimization, following the analysis in the Section 3.2 in Kumar et al. By choosing different regularizer, there are a variety of instances within CQL family. B.36 called CFCQL( H) which is the update rule we used: In discrete action space, we train a three-level MLP network with MLE loss. In continuous action space, we use the method of explicit estimation of behavior density in Wu et al.
A Mathematical Derivation
The intrinsic reward function of skill discovery is shown as follows. We compare our approach with multi-agent value decomposition methods (QMIX and QPLEX), role-based methods (ROMA and RODE), diversity-based method (CDS) and skill-based method (HSD). We develop our method based on the Python MARL framework (PyMARL) on the github. QMIX, which could be found in the source codes. Parameter V alue Algorithm hyper-parameters Discount factor 0.99 Batch size 32 Buffer size 5000 Optimizer RMSprop Learning rate 0.0005 Interval of target network update 200 Agent network hyper-parameters Temporal module in Agent network GRU Dimensions of hidden states of temporal module 64 Mixing network hyper-parameters Dimensions of mixing network embedding 32 Number of hyper network layers 2 Dimensions of hyper network embedding 64 HSL hyper-parameters Dimensions of skill representation encoder embedding 20 Reward decoder scaling factor 10 Cosine distance scaling factor 0.1, 1 Skill representation learning mechanism training steps 50000 Dimensions of skill selector encoding network embedding 32 Decision interval of the skill selector 5 For configuration of the rest hyper-parameters in our framework, we list them in Table 2.
- Oceania > Australia > New South Wales > Sydney (0.04)
- North America > Puerto Rico > San Juan > San Juan (0.04)
- Europe > France > Hauts-de-France > Nord > Lille (0.04)
- Asia > Japan > Honshū > Kansai > Osaka Prefecture > Osaka (0.04)
D2C-HRHR: Discrete Actions with Double Distributional Critics for High-Risk-High-Return Tasks
Zhang, Jundong, Situ, Yuhui, Zhang, Fanji, Deng, Rongji, Wei, Tianqi
Tasks involving high-risk-high-return (HRHR) actions, such as obstacle crossing, often exhibit multimodal action distributions and stochastic returns. Most reinforcement learning (RL) methods assume unimodal Gaussian policies and rely on scalar-valued critics, which limits their effectiveness in HRHR settings. We formally define HRHR tasks and theoretically show that Gaussian policies cannot guarantee convergence to the optimal solution. To address this, we propose a reinforcement learning framework that (i) discretizes continuous action spaces to approximate multimodal distributions, (ii) employs entropy-regularized exploration to improve coverage of risky but rewarding actions, and (iii) introduces a dual-critic architecture for more accurate discrete value distribution estimation. The framework scales to high-dimensional action spaces, supporting complex control domains. Experiments on locomotion and manipulation benchmarks with high risks of failure demonstrate that our method outperforms baselines, underscoring the importance of explicitly modeling multimodality and risk in RL.
- Asia > Middle East > Jordan (0.04)
- Asia > China > Guangdong Province > Guangzhou (0.04)
- Oceania > Australia > New South Wales > Sydney (0.04)
- North America > Puerto Rico > San Juan > San Juan (0.04)
- Europe > France > Hauts-de-France > Nord > Lille (0.04)
- Asia > Japan > Honshū > Kansai > Osaka Prefecture > Osaka (0.04)
Handoff Design in User-Centric Cell-Free Massive MIMO Networks Using DRL
Ammar, Hussein A., Adve, Raviraj, Shahbazpanahi, Shahram, Boudreau, Gary, Bahceci, Israfil
--In the user-centric cell-free massive MIMO (UC-mMIMO) network scheme, user mobility necessitates updating the set of serving access points to maintain the user-centric clustering. Such updates are typically performed through handoff (HO) operations; however, frequent HOs lead to overheads associated with the allocation and release of resources. This paper presents a deep reinforcement learning (DRL)-based solution to predict and manage these connections for mobile users. Our solution employs the Soft Actor-Critic algorithm, with continuous action space representation, to train a deep neural network to serve as the HO policy. We present a novel proposition for a reward function that integrates a HO penalty in order to balance the attainable rate and the associated overhead related to HOs. We develop two variants of our system; the first one uses mobility direction-assisted (DA) observations that are based on the user movement pattern, while the second one uses history-assisted (HA) observations that are based on the history of the large-scale fading (LSF). Simulation results show that our DRL-based continuous action space approach is more scalable than discrete space counterpart, and that our derived HO policy automatically learns to gather HOs in specific time slots to minimize the overhead of initiating HOs. Our solution can also operate in real time with a response time less than 0 . Index T erms --Mobility, handoff, handover, user-centric, cell-free massive MIMO, distributed MIMO, deep-reinforcement learning, soft actor critic, machine learning, channel aging. User-centric cell-free massive MIMO (UC-mMIMO) is a wireless network architecture where each user is served by a custom group of neighboring access points (APs) which are connected to a central unit (CU) via fronthaul links [1]. Unlike the current cellular system that is based on macro base stations, UC-mMIMO deploys cooperative APs that jointly serve users without relying on a traditional cellular boundaries. UC-mMIMO helps to achieve reliable wireless connectivity and provides uniform performance throughout the network [1], [2]. However, this beyond-5G mobile wireless network architecture introduces the key challenge of determining the connections between the APs and the users when moving through the network [3].
- North America > Canada > Ontario > Toronto (0.14)
- North America > Canada > Ontario > National Capital Region > Ottawa (0.14)
- North America > Canada > Ontario > Kingston (0.14)
- North America > Canada > Ontario > Durham Region > Oshawa (0.04)
- Telecommunications (1.00)
- Information Technology > Networks (0.66)
- Information Technology > Communications > Networks (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)